## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
This dataset contains information about the quality of different variations of red wine. There are 1599 observations and 13 variables.
Since I have little knowledge of wine, I researched the variables in relation to their importance in wine quality:
Fixed acidity relates to the sourness of wines- wines from grapes in cooler climates are higher in fixed acidity are more sour, while wines from grapes in warmer climates are low in acidity and therefore are more mild. http://waterhouse.ucdavis.edu/whats-in-wine/fixed-acidity
Volatile acidity is in reference to the aecetic acid component found in some wines, which is usually not present and is mostly found in vinegars.
https://en.wikipedia.org/wiki/Acids_in_wine
Citric Acid is present in grapes, and is seen as affecting the ‘fresh’ taste in many wines. It occurs more frequently in white and rose wines than in reds, so I would expect to see lower values of citric acid in this dataset. https://www.winefrog.com/definition/243/citric-acid
Residual sugar is the sugar content of the wine, which balances the acidity. https://drinks.seriouseats.com/2013/04/wine-jargon-what-is-residual-sugar-riesling-fermentation-steven-grubbs.html
Chlorides contribute to the saltiness of the wine, and are derived from the soil in which the grapes are grown. http://www.scielo.br/scielo.php?script=sci_arttext&pid=S0101-20612015000100095
Free sulfur dioxide occurs naturally in the wine, while Total sulfur dioxide includes the sulfates added by the winemaker to prevent the wine from going bad. Red wines usually have less added sulfates, so these numbers should be very similar. https://winobrothers.com/2011/10/11/sulfur-dioxide-so2-in-wine/ https://www.practicalwinery.com/janfeb09/page5.htm
Density of wine is determined by the concentration of “…alcohol, sugar, glycerol, and other dissolved solids.” https://www.etslabs.com/analyses/DEN
pH is a very good indicator of a wine’s quality. http://winemakersacademy.com/importance-ph-wine-making/
Sulphates are used to preserve the flavor and freshness of wine. https://www.scientificamerican.com/article/myths-about-sulfites-and-wine/
Alcohol is the alcohol content of the wine. Most reds are between 12 and 15%. http://winefolly.com/tutorial/alcohol-content-in-wine/
Quality is a rating of quality ranging from 3 to 8.
Here we will conduct a preliminary exploration of the dataset.
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
This plot shows fixed acidity of the entire dataset, with the majority clustering between about 7 and 10 rating.
This is volatile acidity, with binwidth set at 0.1- again, there is a large cluster between about 0.3 and 0.7, with a few outliers higher and lower.
This plot shows citric acid in the dataset. It mimics the first two graphs except for the large quantity of wines that have a very low level of citric acid- something we would expect from a dataset featuring many red wines, which tend to be lower in citric acid than whites.
A plot of residual sugars, most clustering between 1 and 3, with the most frequently occurring rating around 2.
In this plot of chlorides we see our first legitimate outlier, hovering just above 0.6, with the rest of the data resting at or below 0.1.
A plot of free sulfur dioxides. Most wines look as though they’re fairly low in these, with a few notable exceptions above 60.
This plot mimics the free sulfur dioxides, and rightly so- free sulfur dioxides are calculated as a part of total sulfur dioxides. The outliers here are far higher than the free sulfur dioxides plot, however.
Density is fairly similar for all of the wines in this dataset- the binwidth is set to 0.0001 in order to see some differentiation here. We can assume that perhaps density is very similar for all red wines.
pH is closely correlated with the quality of wine, so this is a variable that we will work with later on. Note the normalized distribution.
Sulfates also reflect similar patterns to the sulfur dioxide charts, as well as the acidity charts- a large cluster early on in the dataset, with a few outliers.
Alcohol- most wines have between 8% and 11%, with a few exceptions.
Quality- most of the wines are mid-range, with a quality of 5 or 6.
The dataset is a series of wines, each with a numerical observation assigned to a series of 13 variables. There are more mid-quality wines than higher or lower quality.
Quality of wine is most important to anyone trying to make an informed purchase- therefore, quality should be included in the analysis of this dataset.
investigation into your feature(s) of interest?
Features affecting the flavor of the wine, such as citric acid and residual sugars. In addition, acidity and pH should affect the quality of the wine, so these should be examined as well.
No new variables have been created in the dataset.
No operations were performed to tidy the data.
## Classes 'tbl_df', 'tbl' and 'data.frame': 6 obs. of 4 variables:
## $ quality : int 3 4 5 6 7 8
## $ mean_pH : num 3.4 3.38 3.3 3.32 3.29 ...
## $ median_pH: num 3.39 3.37 3.3 3.32 3.28 3.23
## $ n : int 10 53 681 638 199 18
Created a new subset of the data using dplyr called wine.pH to determine the mean pH of the wine in this dataset.
As we can see, the pH is clearly higher in lower quality wine- which makes sense to anyone who’s ever had wine that has turned too vinegary!
Here we can see again the relationship between pH and quality. Despite the graph that we created before with the mean pH levels, it becomes apparent here that pH for low-quality or high-quality wines can occur almost anywhere on the spectrum. Mid-quality wines do cluster together, around 5 and 6, with a pH between about 3.1 and 3.5.
Here, quality and volatile acidity. There does appear to be a trend here, with the highest volatile acidity attributed to the lower quality wines.
Here is a comparison between residual sugars and fixed acidity. There doesn’t seem to be any correlation between these variables.
Here is a comparison between residual sugars and citric acid. Again, there doesn’t seem to be any correlation between these variables.
Here is a comparison between citric acid and fixed acidity. There is clearly a high positive correlation, though the highest density of wines occurs where fixed acidity is either 0 or very close to 0.
In this plot we see a positive correlation- again, not unexpected. The highest number of wines have low quantities of both total and free sulfur dioxides.
Surprisingly, fixed acidity and density appear to be positively correlated.
Alcohol content appears to have a very weak correlation with quality- the highest quality wines all have an alcohol content around 10% or above.
Here, the higher density wines on average have lower alcohol content. There is a very weak negative correlation apparent from this graph.
Quality ended up being a fairly uninformative variable- the more interesting comparisons are between pH, density, and volatile acidity.
There are several strong relationships apparent: first between free sulfur dioxides and total sulfur dioxides which is to be expected, as we saw in the source linked above- free sulfur dioxides are counted within the total sulfur dioxides. Another strong relationship is between density and fixed acidity. The final strongly positive relationship is between citric acid and fixed acidity.
We can see that there are some patterns beginning to emerge here- the higher quality wines have higher fixed acidity, and slightly lower density. Mid quality wines have higher density, and lower fixed acidity. Interestingly, the poor quality wines seem to be distributed throughout.
Here is a much weaker pattern than above, but still apparent: Mid quality wines have lower fixed acidity and lower citric acid, while higher quality wines have higher citric acid and higher fixed acidity. However, the data point with the highest citric acid also happens to be lower quality- look at 1.00 on the X axis for the yellow point.
Here the only pattern is that the mid-quality wines have a wider range of total sulfur dioxides, while the higher quality wines have below about 100 total sulfur dioxides.
Here, clearly, higher density wines are poorer quality.
Density tended to be interesting, as it was able to differentiate between qualities of wines. Citric acid was also surprising- the higher the content of citric acid, the better quality wine.
I was surprised that quality was not clearly delineated in many of the plots. I expected much firmer stripes of color, and many of the plots it’s impossible to see any clear patterns.
I chose this plot because it shows a typical normal distribution of the data across various levels of pH. It is interesting because it is directly related to quality- higher pH means lower quality- and, as such, it actually accurately reflects the graph made for quality, above- only reversed.
I chose to use the bivariate plot of quality and alcohol content, with some variations to make the plot more readable. This plot is interesting first because it shows a clear positive trend of higher content of alcohol in higher quality wines, especially when examining the median line added here.
This plot is interesting because it relates not only to the quality, but to the taste of the wine. Anyone concerned with this dataset for consumption purposes would find this vital- that the wines with higher levels of citric acid, or the ones described as lighter, fruitier, and crisper, would be of higher quality. I’ve also added a line so that it is possible to see the mean throughout.
Upon reflection, I am surprised to see that not many of these variables correlate to quality. For instance, I would have expected residual sugar or any of the acidities to correlate with quality, and that was not the case. Frequently while working with this data I wished that the set was much larger, so that it would be easier to see clearer trends. I was surprised also to see the number of mid-quality wines that shared almost the same aspects as the higher quality wines, meaning that consumers who purchase very expensive bottles may be drinking a wine that is, in essence, the same as another less expensive wine.